{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "rl-hw-problems.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "metadata": {
        "id": "O0fFgFRGksH6",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Installation\n",
        "\n",
        "To run this you need several packages. First of all, you need anaconda, which you most likely already have if you're viewing this through jupyter. If not then check the readme on the class page.\n",
        "\n",
        "System requirements: This should work on all operating systems (Linux, Mac, and Windows). However, several of the environments in the OpenAI-gym require additional simulators which don't aren't easy to get on Windows. In any case, it is strongly recommended that you use Linux, although you should be ok with Mac. (HINT: if you're on Windows check out the Windows Subsystem for Linux (WSL), although it'll make visualizing your policies a little tricky).\n",
        "\n",
        "Then install the following packages (using conda or pip):\n",
        "\n",
        "- pytorch --> `conda install pytorch -c pytorch`\n",
        "- gym --> `pip install gym`\n",
        "- gym (the cool environments, doesnt work on Windows) --> `pip install gym[all]`\n",
        "(When install gym[all] don't worry if the mujoco installation doesn't work. That's a more advanced 3D physics simulator that has to be set up separately (see website). Anyway, we don't need it necessarily)."
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "nomKMuz0lrWL",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# If you're using colab, this will install the necessary packages!\n",
        "!pip install torch\n",
        "!pip install gym\n",
        "!wget https://pjreddie.com/media/files/rlhw_util.py"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "NzmPUAD1ksH7",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import sys, os, time\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import torch\n",
        "import torch.nn as nn\n",
        "import torch.optim as optim\n",
        "import torch.nn.functional as F\n",
        "import torch.multiprocessing as mp\n",
        "from torch import distributions\n",
        "from torch.distributions import Categorical\n",
        "from itertools import islice\n",
        "\n",
        "import gym"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "WXry0cZLksH_",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Introduction\n",
        "Welcome to the RL playground. Your task is to implement the REINFORCE and A3C algorithm to solve various OpenAI-gym environments. If you are not familiar with OpenAI-gym, stop reading and visit https://gym.openai.com/envs/ to see all the tasks you can try to solve.\n",
        "\n",
        "In this homework, we will only look at tasks with a discrete (and small) action space. That being said, both algorithms can be modified slightly to work on tasks with continuous action spaces. For full credit you must fill in the code below so you achieve an average total reward per episode on the cartpole task (CartPole-v1) of at least 499 (for an episode length of 500) for both REINFORCE and A3C. Then you must apply your code to any one other environment in OpenAI-gym, and plot and compare the learning curves (average total reward per episode vs number of episodes trained on) between REINFORCE and A3C (where at least one of the algorithms shows significant improvement from initialization).\n",
        "\n",
        "Below there's an overview of what every iteration will look like, regardless of whether you want to train or evaluate your agent."
      ]
    },
    {
      "metadata": {
        "id": "IBY204WCksIA",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from rlhw_util import * # <-- look whats inside here - it could save you a lot of work!\n",
        "\n",
        "def run_iteration(mode, N, agent, gen, horizon=None, render=False):\n",
        "    train = mode == 'train'\n",
        "    if train:\n",
        "        agent.train()\n",
        "    else:\n",
        "        agent.eval()\n",
        "\n",
        "    states, actions, rewards = zip(*[gen(horizon=horizon, render=render) for _ in range(N)])\n",
        "\n",
        "    loss = None\n",
        "    if train:\n",
        "        loss = agent.learn(states, actions, rewards)\n",
        "\n",
        "    reward = sum([r.sum() for r in rewards]) / N\n",
        "\n",
        "    return reward, loss"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "FbOI_tXDksIC",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## The Actor\n",
        "\n",
        "We need to learn a policy which, given some state, outputs a distribution over all possible actions. As this is deep RL, we'll use a deep neural network to turn the observed state into the requisite action distribution. From this action distribution we can choose what action to take using `get_action`. Pytorch, brilliant as it is, makes our task incredibly easy, as we can use the `torch.distributions.Categorical` class for sampling.\n",
        "\n",
        "You can experiment with all sorts of network architectures, but remember this is RL, not image classification on ImageNet, so you probably won't need a very deep network (HINT: look below at the state and action dimensionality to get a feel for the task)."
      ]
    },
    {
      "metadata": {
        "id": "QkW0tUvZksID",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "class Actor(nn.Module):\n",
        "    def __init__(self, state_dim, action_dim):\n",
        "        super(Actor, self).__init__()\n",
        "        \n",
        "        # TODO: Fill in the code to define you policy\n",
        "        \n",
        "        raise NotImplementedError\n",
        "        \n",
        "    def forward(self, state):\n",
        "        \n",
        "        # TODO: Fill in the code to run a forward pass of your policy to get a distribution over actions (HINT: probabilities sum to 1)\n",
        "        \n",
        "        raise NotImplementedError\n",
        "\n",
        "    def get_policy(self, state):\n",
        "        return Categorical(self(state))\n",
        "\n",
        "    def get_action(self, state, greedy=None):\n",
        "        if greedy is None:\n",
        "            greedy = not self.training\n",
        "\n",
        "        policy = self.get_policy(state)\n",
        "        return MLE(policy) if greedy else policy.sample()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "5NYbREzFksIF",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## The REINFORCE Agent\n",
        "\n",
        "The Actor defines our policy, but we also have to define how and when we'll be updating our policy, which brings us to the agent. The agent will house the policy (an `Actor`), and can then be used to generate rollouts (using `forward()`) or update the policy given a list of rollouts (using `learn()`).\n",
        "\n",
        "The REINFORCE algorithm naively uses the returns directly to weight the gradients, however this makes the variance in the policy gradient estimation very large. As a result, we will use a baseline which is a linear model which takes in a state and outputs the return (sounds like a value function, right?). Except we're not going to train our baseline using gradient descent, instead we'll just solve the linear system analytically in every iteration, and use the solution in the next iteration. Don't worry about training/updating the baseline, but you do have to use it in the right way. (Optional experiment: try removing the baseline and see how performance changes)"
      ]
    },
    {
      "metadata": {
        "id": "nQ7mlwvLksIG",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "class REINFORCE(nn.Module):\n",
        "    \n",
        "    def __init__(self, state_dim, action_dim, discount=0.97, lr=1e-3, weight_decay=1e-4):\n",
        "        super(REINFORCE, self).__init__()\n",
        "        self.actor = Actor(state_dim, action_dim)\n",
        "        \n",
        "        self.baseline = nn.Linear(state_dim, 1)\n",
        "        \n",
        "        # TODO: create an optimizer for the parameters of your actor (HINT: use the passed in lr and weight_decay args)\n",
        "        raise NotImplementedError\n",
        "        \n",
        "        self.discount = discount\n",
        "        \n",
        "    def forward(self, state):\n",
        "        return self.actor.get_action(state)\n",
        "    \n",
        "    def learn(self, states, actions, rewards):\n",
        "        '''\n",
        "        Takes in three arguments each of which is a list with equal length. Each element in the list is a \n",
        "        pytorch tensor with 1 row for every step in the episode, and the columns are state_dim, action_dim, \n",
        "        and 1, respectively.\n",
        "        '''\n",
        "        \n",
        "        # TODO: implement the REINFORCE algorithm (HINT: check the slides/papers)\n",
        "        \n",
        "        returns = [compute_returns(rs, discount=self.discount) for rs in rewards]\n",
        "        \n",
        "        states, actions, returns = torch.cat(states), torch.cat(actions), torch.cat(returns)\n",
        "        \n",
        "        raise NotImplementedError\n",
        "        \n",
        "        error = F.mse_loss(self.baseline(states).squeeze(), returns).detach()\n",
        "        solve(states, returns, out=self.baseline)\n",
        "        #error = F.mse_loss(self.baseline(states).squeeze(), returns).detach()\n",
        "        \n",
        "        return error.item() # Returns a rough estimate of the error in the baseline (dont worry about this too much)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "ogkj6xjeksII",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## The Critic\n",
        "\n",
        "Now we can introduce a critic, which is essentially a value function to estimate the expected discounted reward of a state."
      ]
    },
    {
      "metadata": {
        "id": "uGYwOY6AksIJ",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "class Critic(nn.Module):\n",
        "    def __init__(self, state_dim):\n",
        "        super(Critic, self).__init__()\n",
        "        \n",
        "        # TODO: define your value function network\n",
        "        \n",
        "        raise NotImplementedError\n",
        "\n",
        "    def forward(self, state):\n",
        "        \n",
        "        # TODO: apply your value function network to get a value given this batch of states\n",
        "        \n",
        "        raise NotImplementedError"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "5rfkylePksIL",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## The A3C Agent\n",
        "\n",
        "Now we can put the actor and critic together using the A3C algorithm. It turns out, the tasks in the gym are all so simple that there is essentially no gain in parallelization, so technically we're implementing A2C (no async), but the RL part is the same."
      ]
    },
    {
      "metadata": {
        "id": "QoZgaABTksIL",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "class A3C(nn.Module):\n",
        "    \n",
        "    def __init__(self, state_dim, action_dim, discount=0.97, lr=1e-3, weight_decay=1e-4):\n",
        "        super(A3C, self).__init__()\n",
        "        self.actor = Actor(state_dim, action_dim)\n",
        "        self.critic = Critic(state_dim)\n",
        "        \n",
        "        # TODO: create an optimizer for the parameters of your actor (HINT: use the passed in lr and weight_decay args)\n",
        "        # (HINT: the actor and critic have different objectives, so how many optimizers do you need?)\n",
        "        raise NotImplementedError\n",
        "        \n",
        "        self.discount = discount\n",
        "        \n",
        "    def forward(self, state):\n",
        "        return self.actor.get_action(state)\n",
        "    \n",
        "    def learn(self, states, actions, rewards):\n",
        "        \n",
        "        returns = [compute_returns(rs, discount=self.discount) for rs in rewards]\n",
        "        \n",
        "        states, actions, returns = torch.cat(states), torch.cat(actions), torch.cat(returns)\n",
        "        \n",
        "        # TODO: implement A3C (HINT: algorithm details found in A3C paper supplement) \n",
        "        # (HINT2: the algorithm is actually very similar to REINFORCE, the only difference is now we have a critic, what might that do?)\n",
        "        \n",
        "        raise NotImplementedError"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Hq-zQWzAksIO",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Part 1: Balancing a pole with a cart\n",
        "\n",
        "First, we'll test both algorithms on a very simple toy system: the cartpole. Eventhough it's very low dimensional (state=4, action=2), this task is nontrival because it is underactuated. Nevertheless after a few thousand episodes our policy shouldn't have a problem! "
      ]
    },
    {
      "metadata": {
        "id": "RGBdP7leksIP",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# Optimization hyperparameters\n",
        "lr = 1e-3\n",
        "weight_decay = 1e-4"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "8vAi8sXuksIS",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "env_name = 'CartPole-v1' \n",
        "#env_name = 'LunarLander-v2'\n",
        "e = Pytorch_Gym_Env(env_name)\n",
        "state_dim = e.observation_space.shape[0]\n",
        "action_dim = e.action_space.n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "scrolled": false,
        "id": "Meq1LT8NksIV",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# Choose what agent to use\n",
        "#agent = REINFORCE(state_dim, action_dim, lr=lr, weight_decay=weight_decay)\n",
        "agent = A3C(state_dim, action_dim, lr=lr, weight_decay=weight_decay)\n",
        "\n",
        "total_episodes = 0\n",
        "print(agent) # Let's take a look at what we're working with..."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "WUl_VOypksIX",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# Create a \n",
        "gen = Generator(e, agent)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Ja001im9ksIZ",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Let's do this!!\n",
        "\n",
        "Below is the loop to train and evaluate your agent. You can play around with the number of iterations to run, and the number of rollouts per iteration. \n",
        "\n",
        "You can rerun this cell multiple times to keep training your model for more episodes. In any case, it shouldn't take more than 30 min to an 1 hour to train. (training never took me more than 5 min). HINT: Keep an eye on the eval_reward, it'll be pretty noisy, but if that should be slowly increasing."
      ]
    },
    {
      "metadata": {
        "id": "8oT0vzayksIa",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "num_iter = 100\n",
        "num_train = 10\n",
        "num_eval = 10 # dont change this\n",
        "for itr in range(num_iter):\n",
        "    #agent.model.epsilon = epsilon * epsilon_decay ** (total_episodes / epsilon_decay_episodes)\n",
        "    #print('** Iteration {}/{} **'.format(itr+1, num_iter))\n",
        "    train_reward, train_loss = run_iteration('train', num_train, agent, gen)\n",
        "    eval_reward, _ = run_iteration('eval', num_eval, agent, gen)\n",
        "    total_episodes += num_train\n",
        "    print('Ep:{}: reward={:.3f}, loss={:.3f}, eval={:.3f}'.format(total_episodes, train_reward, train_loss, eval_reward))\n",
        "    \n",
        "    if eval_reward > 499 and env_name == 'CartPole-v1': # dont change this\n",
        "        print('Success!!! You have solved cartpole task! Time for a bigger challenge!')\n",
        "    \n",
        "    # save model\n",
        "print('Done')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "sxlxf0YwksIc",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# You can visualize your policy at any time\n",
        "run_iteration('eval', 1, agent, gen, render=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "collapsed": true,
        "id": "0SrE9Kh0ksIe",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Analysis\n",
        "\n",
        "Plot the performance of each of your agents for the cartpole task and one additional task. When choosing a new environment, make sure is has a discrete action space. For each plot the x axis should show the total number of episodes the model was trained on, and the y axis shows the average total reward per episode.\n",
        "\n",
        "You can leave the plots as cell outputs below, or you can save them as images and submit them separately.\n",
        "\n",
        "### Deliverables\n",
        "- single plot showing both the REINFORCE algorithm's performance, and A3C's performance on the same plot for the cartpole environment (CartPole-v1).\n",
        "- single plot showing both the REINFORCE algorithm's performance, and A3C's performance on the same plot for a second environment of your choice (suggested -> LunarLander-v2, it's a little tricky but watching the agent fly spaceships is very entertaining!).\n",
        "- in every case you models have to learn something for full credit."
      ]
    },
    {
      "metadata": {
        "id": "90yijcV0ksIf",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}